视觉语言导航(VLN)任务要求代理商通过自然语言指令的指导到达目标。以前的作品学会在指令后逐步导航。然而,这些作品可能无法歧视跨指令轨迹对的相似性和差异,并忽略子指令的时间连续性。这些问题妨碍了代理人学习独特的视觉和语言表示,损害了导航政策的稳健性和普遍性。在本文中,我们提出了一种对比的指令轨迹学习(Citl)框架,探讨了不同数据样本的不变性,而不同的数据样本和方差以学习强大导航的独特表示。具体而言,我们提出:(1)通过分别对比完整轨迹观测和指示的语义来提高视觉和语言表示来提高视觉和语言。 (2)细粒度对比学学习目的,通过利用子指示的时间信息来感知指示; (3)对矿井硬样品对比学学习的成对采样重量机制,从而减轻了数据采样偏差在对比学习中的影响。我们的Citl可以轻松地与VLN骨干网集成,形成新的学习范例,并在看不见的环境中实现更好的普遍性。广泛的实验表明,Citl的模型超越了R2R,R4R和RXR上以前的最先进的方法。
translated by 谷歌翻译
在本文中,我们研究了通过减少优化难度来改善对抗性训练(AT)获得的对抗性鲁棒性。为了更好地研究这个问题,我们为AT建立了一个新颖的Bregman Divergence观点,其中可以将其视为负熵曲线上训练数据点的滑动过程。基于这个观点,我们分析了方法(即PGD-AT和Trades)的两个典型方法的学习目标,并且我们发现交易的优化过程比PGD-AT更容易,而PGD-AT则将PGD-AT分开。此外,我们讨论了熵在贸易中的功能,我们发现具有高熵的模型可以是更好的鲁棒性学习者。受到上述发现的启发,我们提出了两种方法,即伪造和MER,它们不仅可以减少10步PGD对手下优化的难度,而且还可以提供更好的鲁棒性。我们的工作表明,在10步PGD对手下减少优化的难度是增强AT中对抗性鲁棒性的一种有前途的方法。
translated by 谷歌翻译
在线算法是算法设计中的重要分支。设计具有有界竞争比率的在线算法(在最坏情况性能方面)可能是艰难的并且通常依赖于特定于问题的假设。由生成对抗净净净(GAN)的对抗训练的启发和在线算法的竞争比率基于最坏情况的输入,我们采用深度神经网络来学习从头开始进行资源分配和定价问题的在线算法对于最坏情况的输入,可以最小化离线最佳和学习的在线算法之间的性能差距的目标。具体而言,我们分别利用两个神经网络作为算法和对手,让他们播放零和游戏,而对验证负责产生最坏情况的输入,而算法基于对手提供的输入学习最佳策略。为了确保算法网络的更好收敛(到所需的在线算法),我们提出了一种新颖的每轮更新方法来处理顺序决策,以便在不同的回合中断复杂依赖性,以便可以为每种可能的动作完成更新,而不是只有采样的行动。据我们所知,我们的作品是首次使用深度神经网络来设计一个在最坏情况性能保证的角度的在线算法。实证研究表明,我们的更新方法确保了纳什均衡的融合,并且学习算法在各种设置下优于最先进的在线算法。
translated by 谷歌翻译
卷积神经网络(CNNS)用于许多现实世界应用,例如基于视觉的自主驾驶和视频内容分析。要在各种目标设备上运行CNN推断,硬件感知神经结构搜索(NAS)至关重要。有效的硬件感知NAS的关键要求是对推理延迟的快速评估,以便对不同的架构进行排名。在构建每个目标设备的延迟预测器的同时,在本领域中通常使用,这是一个非常耗时的过程,在极定的设备存在下缺乏可扩展性。在这项工作中,我们通过利用延迟单调性来解决可扩展性挑战 - 不同设备上的架构延迟排名通常相关。当存在强烈的延迟单调性时,我们可以重复使用在新目标设备上搜索一个代理设备的架构,而不会丢失最佳状态。在没有强烈的延迟单调性的情况下,我们提出了一种有效的代理适应技术,以显着提高延迟单调性。最后,我们验证了我们的方法,并在多个主流搜索空间上使用不同平台的设备进行实验,包括MobileNet-V2,MobileNet-V3,NAS-Bench-201,Proxylessnas和FBNet。我们的结果突出显示,通过仅使用一个代理设备,我们可以找到几乎与现有的每个设备NAS相同的帕累托最优架构,同时避免为每个设备构建延迟预测器的禁止成本。 github:https://github.com/ren-research/oneproxy.
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
translated by 谷歌翻译